root mean square layer normalization
Root Mean Square Layer Normalization
Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g.
Reviews: Root Mean Square Layer Normalization
ORIGINALITY: The proposed normalization technique is original in the sense that the main difference in existing normalization techniques (batch, layer, group, instance..) differ only in the dimensions over which the activations are normalized. This paper proposes removing one of the typical steps in the normalization process in order to speed up training, which has been less well-studied - This work proposes dividing by the RMS statistic instead of standard deviation without hurting accuracy. Other works (for example, Santurkar et al.) experiment with scaling by different statistics, such as various l_p norms, without a loss in training accuracy. This work is not the first to suggest scaling the activations by a different statistic QUALITY: The authors tested their technique on multiple deep learning frameworks (TensorFlow, PyTorch, Theano), which gives more support to their empirical results, as different implementations can have very different timing results The authors tested their technique on multiple tasks and neural network architectures - The main hypothesis hypothesis is that the re-centering step in Layer Normalization is dispensable, and this is backed only by experimental results and could be a lot stronger with some theoretical justification - While the few experimental results show that there is no degradation of accuracy from not centering the activations, I am still not fully convinced that the centering step can be deemed unnecessary. For example, it is likely that the weights/biases of the networks in the paper are initialized such that the activations are roughly centered around zero already, and therefore the mean-centering step can be removed without seeing much of a difference in performance.
Reviews: Root Mean Square Layer Normalization
The authors present a new form of normalization for deep networks called RMSNorm. Because the method only requires a single pass of statistics calculations, the authors demonstrate improved training times for both machine translation and image caption retrieval while maintaining predictive accuracy. As commented by the reviewers, the paper is clearly written; the results are clearly presented and the experiments are quite thorough (different ML systems; ML architectures). In sum, the results and convincing (1 reviewer upgrade their score accordingly) and the results are use-able by those that build language models and potentially other forms of deep networks that require normalization schemes. For these reasons, assuming the authors revise the paper to address all reviewer comments, this paper is accepted to this conference.
Root Mean Square Layer Normalization
Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm.
Root Mean Square Layer Normalization
Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm.